Mining All Non-derivable Frequent Itemsets

نویسندگان

  • Toon Calders
  • Bart Goethals
چکیده

Recent studies on frequent itemset mining algorithms resulted in significant performance improvements. However, if the minimal support threshold is set too low, or the data is highly correlated, the number of frequent itemsets itself can be prohibitively large. To overcome this problem, recently several proposals have been made to construct a concise representation of the frequent itemsets, instead of mining all frequent itemsets. The main goal of this paper is to identify redundancies in the set of all frequent itemsets and to exploit these redundancies in order to reduce the result of a mining operation. We present deduction rules to derive tight bounds on the support of candidate itemsets. We show how the deduction rules allow for constructing a minimal representation for all frequent itemsets. We also present connections between our proposal and recent proposals for concise representations and we give the results of experiments on real-life datasets that show the effectiveness of the deduction rules. In fact, the experiments even show that in many cases, first mining the concise representation, and then creating the frequent itemsets from this representation outperforms existing frequent set mining algorithms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Depth-First Non-Derivable Itemset Mining

Mining frequent itemsets is one of the main problems in data mining. Much effort went into developing efficient and scalable algorithms for this problem. When the support threshold is set too low, however, or the data is highly correlated, the number of frequent itemsets can become too large, independently of the algorithm used. Therefore, it is often more interesting to mine a reduced collecti...

متن کامل

Closed Non-derivable Itemsets

Itemset mining typically results in large amounts of redundant itemsets. Several approaches such as closed itemsets, non-derivable itemsets and generators have been suggested for losslessly reducing the amount of itemsets. We propose a new pruning method based on combining techniques for closed and non-derivable itemsets that allows further reductions of itemsets. This reduction is done without...

متن کامل

Non-Derivable Item Set and Non-Derivable Literal Set Representations of Patterns Admitting Negation

The discovery of frequent patterns has attracted a lot of attention of the data mining community. While an extensive research has been carried out for discovering positive patterns, little has been offered for discovering patterns with negation. The main hindrance to the progress of such research is huge amount of frequent patterns with negation, which exceeds the number of frequent positive pa...

متن کامل

Data sanitization in association rule mining based on impact factor

Data sanitization is a process that is used to promote the sharing of transactional databases among organizations and businesses, it alleviates concerns for individuals and organizations regarding the disclosure of sensitive patterns. It transforms the source database into a released database so that counterparts cannot discover the sensitive patterns and so data confidentiality is preserved ag...

متن کامل

Recent Developments in Pattern Mining

Pattern Mining is one of the most researched topics in the data mining community. Literally hundreds of algorithms for efficiently enumerating all frequent itemsets have been proposed. These exhaustive algorithms, however, all suffer from the pattern explosion problem. Depending on the minimal support threshold, even for moderately sized databases, millions of patterns may be generated. Althoug...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002